Introduction

Saltwater intrusion and sea level rise (SWISLR) are complex and intertwined processes that threaten coastal infrastructure and erode biodiverse shoreline habitats. In simpler terms, SWISLR threatens the water quality of coastal communities by flooding sewage infrastructure, leaching toxins into groundwater sources, and contaminating well water.

Research can help scientists better understanding how these processes work and inspire new solutions or enhance previous ones to better equip frontline communities and stakeholders with the support and tools to tackle climate change, especially in the most vulnerable communities.

A North Carolina wetland.

A North Carolina wetland.

Although having every scientist conducting their own fieldwork is both invaluable and rewarding, it’s also expensive. That makes it imperative that scientists are able to share their data in accessible ways so that everyone can both contribute and benefit from open-access data.

One of the largest repositories of water quality data is the United States Geological Survey’s (USGS) National Water Information System (NWIS) and the Environmental Protection Agency’s (EPA) STOrage and RETrieval (STORET) database. Both of these contain hundreds of thousands of data points that can enhance current discourse on solutions to SWISLR.

The goal of this document is to understand the applications and limitations of the USGS NWIS repository, and evaluate areas of improvement to currently available data. This file contains the code that was used to analyze the data, and visualizations to to describe the data.

The key goals are:

  1. Understand the structure of USGS Water Quality data for specific conductance.
  2. Understand the makeup of the data in terms of various conditions like sample type, parameters tested, and the data quality indicators.
  3. Understand how sites are distributed across the state.
  4. Determine whether the data is consistent with the literature I have read so far.
  5. Outline next steps after this report.

Preparing the R Environment

The HASP Package

# First install the remotes package 
install.packages("remotes") 

# Install HASP package 
remotes::install_gitlab("water/stats/hasp", 
                        host = "code.usgs.gov", 
                        build_opts = c("--no-resave-data","--no-manual"), 
                        build_vignettes = TRUE, 
                        dependencies = TRUE) 

Loading Libraries

library(tidyverse) 
library(remotes)
library(HASP)
library(dataRetrieval)
library(dplyr)
library(ggplot2)
library(sf)
library(olsrr)
library(maps)
library(DataExplorer)
library(nycflights13)
library(knitr)
library(lubridate)
library(dygraphs)
library(xts) 
library(zoo)
library(stats)

USGS Site Location Data

Identifying Parameter Codes

Specific Conductance Parameters Measured in Water
PCode Type Description Sample Type Temperature Units
00094 Physical Specific conductance, water, unfiltered, field, microsiemens per centimeter at 25 degrees Celsius Total 25 deg C uS/cm (25C?)
00095 Physical Specific conductance, water, unfiltered, microsiemens per centimeter at 25 degrees Celsius Total 25 deg C uS/cm (25C?)
00402 Physical Specific conductance, non-temperature corrected, water, unfiltered, microsiemens per centimeter Total uS/cm
70386 Physical Salinity, water, computed by regression equation from specific conductance, milligrams per liter Total mg/l
72430 Physical Specific conductance, water, filtered, field, microsiemens per centimeter at 25 degrees Celsius Dissolved 25 deg C uS/cm (25C?)

Pulling Site Location Data

# Pulling site location information data in North Carolina 
pCode <- c("00094", "00095", "00402", "70386", "72430", "90094", "90095", 
           "90096", "99974", "99978", "99982")

# Pulling USGS repository data that is in North Carolina and contains the parameters we want 
scNC <- whatNWISdata(stateCd = "NC", 
                     parameterCd = pCode)

Tidying Site Location Data

# Creating new columns with the start and end year only 
scNC <- scNC %>%
  separate(begin_date, into = c("Year_start"), sep = "-", remove = FALSE) %>% 
  separate(end_date, into = c("Year_end"), sep = "-", remove = FALSE) 

# Adding a "period" column and filtering for sites with nitrate measurements starting from 2010 
scNC_1 <- scNC %>% 
  mutate(period = as.Date(end_date) - as.Date(begin_date)) %>% 
  filter(Year_end > 2000) 
Tidy USGS Data
Station Name Site Number Parameter Code Start Year End Year
NORTHWEST RIVER ABOVE MOUTH NEAR MOYOCK, NC 02043410 00095 2006 2007
NORTHWEST RIVER ABOVE MOUTH NEAR MOYOCK, NC 02043410 00095 2006 2007
NORTHWEST RIVER ABOVE MOUTH NEAR MOYOCK, NC 02043410 00095 2006 2007
NORTHWEST RIVER ABOVE MOUTH NEAR MOYOCK, NC 02043410 00095 2006 2007
TULL CREEK AT SR 1222 NEAR CURRITUCK, NC 02043415 00095 2006 2007

By now, the only two parameters in our data are “00095” and “90095”

# Check parameters 
unique(scNC_1$parm_cd)
## [1] "00095" "90095"

USGS Water Quality Data

Pulling Water Quality Data

# Get nitrate measurements 
sc_sites <- scNC_1$site_no

sc_data <- readNWISqw(siteNumbers = sc_sites, parameterCd = pCode)

Tidying Water Quality Data

# Tidy data 
sc_data <- sc_data %>%
  separate(sample_dt, into = c("Year_start"), sep = "-", remove = FALSE) %>% 
  filter(Year_start > 2000)

Exploratory Data Analysis

Data Structure

# View the data structure 
plot_str(sc_data)

Data Information

# Let's take a look at some information about our data frame 
data_structure_1 <- introduce(sc_data)
Information about the NWIS Dataset
Category Data
Rows 14903
Columns 38
Discrete Columns 32
Continuous Columns 5
All Missing Columns 1
Total Missing Values 244882
Complete Rows 0
Total Observations 566314
Memory Usage 5310640

Visualizing Data Information

# Putting what we know in a visual format 
data_structure_2 <- plot_intro(sc_data)

Missing Data

# Understanding the data we don't have access to 
data_structure_3 <- plot_missing(sc_data) + 
  labs(title = "Percentage of Observations Missing")

Mapping USGS Data Collection Sites

# Create a map of all the locations where nitrate in water is measured 
map_scNC <- ggplot() +
  geom_sf(data = usa[ usa$ID == "north carolina" ,]) +
  geom_sf(data = sf_scNC_1) + 
  xlab(NULL)+
  ylab(NULL)+
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) + 
  labs(title = "Sites that Measure Specific Conductance of Water in North Carolina since 2000", 
       caption = "United States Geological Survey, 2024.") 
map_scNC

Understanding Specific Conductance Data

By Parameter

# Parameter code bar graph 
ggplot(freq, aes(x = Frequency, y = factor(parm_cd))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Horizontal Bar Graph with Frequency", x = "Frequency", y = "Parameter") +
  theme_minimal()

By Sample Date

# Sample date graph 
date_counts <- table(sc_data$sample_dt)

barplot(date_counts, 
        main = "Frequency of Sample Dates",
        xlab = "Date",
        ylab = "Frequency",
        col = "skyblue")

By Sample Time

Scatterplot

plot_n <- ggplot(data = sc_data, aes(x = sample_dt, y = result_va)) + 
  geom_point() + 
  scale_y_log10() + 
  labs(title = "Specific Conductance levels in North Carolina Waters", 
       x = "Year", 
       y = "Specific Conductance in us/cm at 25 degrees C", 
       caption = "USGS, 2024.") +
  geom_smooth(method = "lm", formula = y ~ poly(x, 3), se = FALSE)
plot_n 

Narrowing Down Coastal Counties

# Import fixed data 
new_sc_counties <- read_csv("scNC_1.csv")
## New names:
## Rows: 2076 Columns: 29
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (15): agency_cd, site_no, station_nm, site_tp_cd, coord_acy_cd, dec_coo... dbl
## (12): ...1, dec_lat_va, dec_long_va, alt_va, alt_acy_va, ts_id, srs_id,... date
## (2): begin_date, end_date
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Add new column 
scNC_1$counties <- new_sc_counties$county

# Filtering by CAMA counties 
counties_of_interest <- c("beaufort", "bertie", "brunswick", 
                          "camden", "carteret", "chowan", "craven", "currituck", 
                          "dare", "gates", "hertford", "hyde", "new hanover", 
                          "onslow", "pamlico", "pasquotank", "pender", "perquimans", 
                          "tyrrell", "washington")

scNC_coastal_counties <- subset(scNC_1, counties %in% counties_of_interest) 

# Filter by year 
scNC_coastal_counties <- scNC_coastal_counties %>% 
  filter(Year_start > 2000)

# Pull coastal measurements 
sc_coastal_data <- readNWISqw(siteNumbers = scNC_coastal_counties$site_no,
                         parameterCd = pCode) 

# Filter again by year 
sc_coastal_data <- sc_coastal_data %>% 
  separate(sample_dt, into = c("Year_start"), sep = "-", remove = FALSE) %>% 
  filter(Year_start > 2000)
Tidy USGS Coastal Data
Agency Site Number Sample Date Sample Time Parameter Code Result
USGS 02043750 2012-08-15 14:38 00095 101
USGS 362925076281300 2012-08-15 13:20 00095 88
USGS 362609076295100 2012-08-15 14:20 00095 85
USGS 362827076294900 2012-08-15 14:10 00095 93
USGS 362838076271400 2012-08-15 13:00 00095 91

Coastal Exploratory Data Analysis

Data Structure

# View the data structure 
plot_str(sc_coastal_data)

Data Information

# Let's take a look at some information about our data frame 
introduce(sc_coastal_data)
Information about the Coastal NWIS Dataset
Category Data
Rows 1086
Columns 37
Discrete Columns 22
Continuous Columns 1
All Missing Columns 14
Total Missing Values 16858
Complete Rows 0
Total Observations 40182
Memory Usage 381024

Visualization Data Information

# Putting what we know in a visual format 
data_structure_2 <- plot_intro(sc_data)

Missing Data

# Understanding the data we don't have access to 
data_structure_3 <- plot_missing(sc_data) + 
  labs(title = "Percentage of Observations Missing")

Mapping Location Sites

# Create a map of all the locations where nitrate in water is measured 
map_scNC <- ggplot() +
  geom_sf(data = usa[ usa$ID == "north carolina" ,]) +
  geom_sf(data = sf_scNC_coast) + 
  xlab(NULL)+
  ylab(NULL)+
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) + 
  labs(title = "Sites that Measure Specific Conductance of Water in North Carolina since 2000", 
       caption = "United States Geological Survey, 2024.") 
map_scNC

Scatterplot

All parameters

# Plot data 
plot_sc_coastal <- ggplot(data = sc_coastal_data, 
                         aes(x = sample_dt, y = result_va)) + 
  geom_point() + 
  scale_y_log10() + 
  geom_smooth(method = "loess") + 
  labs(title = "Specific Conductance in North Carolina Coastal Waters since 2000", 
       x = "Date", 
       y = "Specific Conductance in uS/cm")
plot_sc_coastal
## `geom_smooth()` using formula = 'y ~ x'

Specific conductance, water, unfiltered

# Get data 
sc_coastal_field <- sc_coastal_data %>% 
  subset(parm_cd == "00095")

# Plot data 
plot_sc_field <- ggplot(data = sc_coastal_field, 
                         aes(x = sample_dt, y = result_va)) + 
  geom_point() + 
  geom_smooth(method = "loess")
plot_sc_field
## `geom_smooth()` using formula = 'y ~ x'

Specific conductance, water, unfiltered, laboratory

# Get data 
sc_coastal_unfiltered <- sc_coastal_data %>% 
  subset(parm_cd == "90095")

# Plot data 
plot_sc_unfiltered <- ggplot(data = sc_coastal_unfiltered, 
                         aes(x = sample_dt, y = result_va)) + 
  geom_point() + 
  geom_smooth(method = "loess")
plot_sc_unfiltered
## `geom_smooth()` using formula = 'y ~ x'

Category Analysis

plot_correlation(correlation_coastal_sc, type = "d")
## 4 features with more than 20 categories ignored!
## site_no: 261 categories
## sample_dt: 316 categories
## sample_tm: 195 categories
## project_cd: 21 categories

Visualizations

By DQI

Here are the samples by their identified data quality indicator. This measure indicates whether a sample is deemed accurate by hydrologists at the USGS. Initially after a sample is taken, it is deemed “presumed satisfactory” (S), and then after review by a hydrologist, only is it considered “reviewed and accepted” (R).

ggplot(freq_5, aes(x = Frequency, y = factor(dqi_cd))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Samples by Data Quality Indicator", x = "Frequency", y = "DQI") +
  theme_minimal()

Most of the samples were reviewed and accepted by a hydrologist, so any outliers are likely not by mistake.

By Parameter

ggplot(freq_7, aes(x = Frequency, y = factor(parm_cd))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Samples by Parameter", x = "Frequency", y = "Hydrological Condition") +
  theme_minimal()

By Sample Date

# Sample date graph 
date_counts <- table(sc_coastal_data$sample_dt)

barplot(date_counts, 
        main = "Frequency of Sample Dates",
        xlab = "Date",
        ylab = "Frequency",
        col = "skyblue")

By Sample Time

barplot(time_counts, 
        main = "Frequency of Samples Taken at Different Times",
        xlab = "Time",
        ylab = "Frequency",
        col = "steelblue",
        las = 2,  # Rotate x-axis labels vertically for better readability
        cex.names = 0.8,  # Adjust label size
        xlim = c(0.5, length(time_counts) + 0.5),  # Adjust x-axis limits for better spacing
        ylim = c(0, max(time_counts) + 1),  # Adjust y-axis limits
        names.arg = names(time_counts),  # Use actual time strings for x-axis labels
        args.legend = list(text.width = 0) 
)

By Hydrological Event

ggplot(freq_3, aes(x = Frequency, y = factor(hyd_event_cd))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Samples by Hydrological Event", x = "Frequency", y = "Hydrological Event") +
  theme_minimal()

By Hydrological Condition

ggplot(freq_2, aes(x = Frequency, y = factor(hyd_cond_cd))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Samples by Hydrological Condition", x = "Frequency", y = "Hydrological Condition") +
  theme_minimal()

Time Series Graph

# Then you can create the xts necessary to use dygraph
don <- xts(x = sc_coastal_data$result_va, order.by = sc_coastal_data$sample_dt)

# Finally the plot
p <- dygraph(don) %>%
  dyOptions(labelsUTC = TRUE, fillGraph=TRUE, fillAlpha=0.1, drawGrid = FALSE, colors="#D8AE5A") %>%
  dyRangeSelector() %>%
  dyCrosshair(direction = "vertical") %>%
  dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE)  %>%
  dyRoller(rollPeriod = 1) 
p

The maximum specific conductance value recorded is 53,600 uS/cm at 25 degrees Celsius. This makes sense because seawater has a specific conductance of around 50,000 uS/cm at 25 degrees Celsius.

t-test

group_a <- sc_coastal_data$result_va
group_b <- non_coastal_data$result_va
 
# Perform the independent sample t-test
t_test_result <- t.test(group_a, group_b)
 
# Print the t-test result
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  group_a and group_b
## t = -0.095602, df = 1051, p-value = 0.9239
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -159.8141  144.9648
## sample estimates:
## mean of x mean of y 
##  542.3087  549.7334

Freshwater is typically between 0 and 1,500 uS/cm. Potable water in the U.S. is between 30-1,500 uS/cm.

Conclusion

So far, the data that I’ve seen doesn’t match the literature I’ve read while creating the SaltWatch Dashboard. However, from all the factors we saw, it could be because of different conditions at each site and time when samples were taken.

Next steps:

  1. See if the same issues are present in the nitrate data.
  2. Calculate salinity from specific conductivity and compare with either the NC Wetlands database or Spencer’s salinity data